feat: support attribute insertion in OTAP transform #1737

ThomsonTan · 2026-01-07T22:31:49Z

codecov · 2026-01-07T22:35:52Z

Codecov Report

❌ Patch coverage is 93.50943% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.72%. Comparing base (e4c170b) to head (c19621a).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1737      +/-   ##
==========================================
+ Coverage   84.64%   84.72%   +0.08%     
==========================================
  Files         502      502              
  Lines      148560   149864    +1304     
==========================================
+ Hits       125745   126975    +1230     
- Misses      22281    22355      +74     
  Partials      534      534

Components	Coverage Δ
otap-dataflow	`86.12% <93.50%> (+0.11%)`	⬆️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.52% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`53.50% <ø> (ø)`
quiver	`90.66% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rust/otap-dataflow/crates/otap/src/attributes_processor.rs

rust/otap-dataflow/crates/pdata/src/otap/transform.rs

albertlockett · 2026-01-19T13:15:07Z

rust/otap-dataflow/crates/pdata/src/otap/transform.rs

+
+/// Helper to extract string key from a record batch key column at a given index.
+/// Handles both plain Utf8 and Dictionary-encoded columns.
+fn get_key_at_index(key_col: &ArrayRef, idx: usize) -> Option<String> {


There's a helper type that can encapsulate this logic: otap_df_pdata::arrays::StringArrayAccessor

otel-arrow/rust/otap-dataflow/crates/pdata/src/arrays.rs

Line 641 in 2c3976c

pub(crate) type StringArrayAccessor<'a> = MaybeDictArrayAccessor<'a, StringArray>;

Instead of having a method defined here specially for this, below in the create_inserted_batch you could do

let key_col = current_batch .column_by_name(consts::ATTRIBUTE_KEY) .map(StringArrayAccessor::try_new) .transpose()?;

Then you could probably return an error if the key_col ends up being None.

This type provides the str_at method which returns an Option<&str>

albertlockett · 2026-01-19T13:40:37Z

rust/otap-dataflow/crates/pdata/src/otap/transform.rs

+            })?;
+    let key_type = schema.field(key_col_idx).data_type();
+
+    let new_keys: ArrayRef = match key_type {


Instead of using the regular arrow builders and having to write these match statements to append values to arrays that could be dict/native type, we have a set of helper types that can encapsulate this logic.

For example, StringArrayBuilder:

otel-arrow/rust/otap-dataflow/crates/pdata/src/encode/record/array.rs

Lines 650 to 656 in 2c3976c

pub type StringArrayBuilder = AdaptiveArrayBuilder<

String,

NoArgs,

StringBuilder,

StringDictionaryBuilder<UInt8Type>,

StringDictionaryBuilder<UInt16Type>,

>;

This type also exposes a append_str_n method which can be used to append the same string multiple times. This is usually faster than appending the values one at a time. So an optimization we could make here, if we're appending the same key multiple times, would be to use this method.

There are similar builders for the other types we're inserting below as values (int, double, bool)

albertlockett · 2026-01-19T13:48:05Z

rust/otap-dataflow/crates/pdata/src/otap/transform.rs

+        })?;
+
+    // Build a set of (parent_id, key) pairs that already exist
+    let mut existing_keys: BTreeMap<u16, BTreeSet<String>> = BTreeMap::new();


I'm wondering if there are any ways we could optimize building up the set of attributes we're going to insert.

For example, using a BTreeMap here, and a BTreeSet below for unique_parents means that we'll be hashing every parent_id multiple times. It might be faster to use a RoaringBitmap for the unique_parent_ids, and maybe we could have a RoaringBitmap for each insert entry corresponding to whether the row w/ some parent_id contains the attribute.

albertlockett · 2026-01-19T13:56:53Z

rust/otap-dataflow/crates/pdata/src/otap/transform.rs

+    let parent_ids_arr = parent_ids
+        .as_any()
+        .downcast_ref::<PrimitiveArray<UInt16Type>>()
+        .ok_or_else(|| Error::ColumnDataTypeMismatch {
+            name: consts::PARENT_ID.into(),
+            expect: DataType::UInt16,
+            actual: parent_ids.data_type().clone(),
+        })?;


The parent_ids won't always be u16. For example, for metrics datapoint attributes and attributes for span link and span event, the parent ID type can be u32. Also, for u32 IDs, the parent_id_arr can be dictionary encoded (e.g. it's not always a PrimitiveArray).

To handle this, we might need to make the function generic over T where:

T: ParentId, <T as ParentId>::ArrayType: ArrowPrimitiveType,

(You'll see we do something similar in this file for materialize_parent_id_for_attributes).

Then we can get the parent_ids as:

Suggested change

let parent_ids_arr = parent_ids

.as_any()

.downcast_ref::<PrimitiveArray<UInt16Type>>()

.ok_or_else(|| Error::ColumnDataTypeMismatch {

name: consts::PARENT_ID.into(),

expect: DataType::UInt16,

actual: parent_ids.data_type().clone(),

})?;

let parent_ids_arr = MaybeDictArrayAccessor::<PrimitiveArray<T::ArrayType>>::try_new(

get_required_array(record_batch, consts::PARENT_ID)?,

)?;

To ensure we hande u32 parent IDs correctly, it probably also makes sense to add a test for this somewhere.

albertlockett · 2026-01-19T14:03:42Z

rust/otap-dataflow/crates/pdata/src/otap/transform.rs

+    // We collect columns into a map or vec matching schema order.
+    let mut columns = Vec::with_capacity(schema.fields().len());
+
+    for field in schema.fields() {


There's a case that the logic here might not handle correctly, which is if we're inserting a type of attribute and the original schema did not previously contain any attributes of this type.

All the ATTRIBUTE_* columns are optional in OTAP. For example, if some attribute RecordBatch had no values of type int, the ATTRIBUTE_INT column would be omitted. If we encountered such a record batch and we were inserting an integer attribute, it would not be included in the original batch with this logic.

We should add a test for this and handle it. We might need to write some custom batch concatenation logic for this, rather than relying on arrows concat_batches` compute kernel, or make some modifications to the original batch before we invoke this function

I think this has been fixed. The necessary metadata is added to schema in extend_schema_for_inserts before the insertion.

albertlockett · 2026-01-19T14:08:51Z

rust/otap-dataflow/crates/pdata/src/otap/transform.rs

+                let combined = arrow::compute::concat_batches(&rb.schema(), &[rb, new_rows])
+                    .map_err(|e| Error::Format {
+                        error: e.to_string(),
+                    })?;


I mentioned in https://github.com/open-telemetry/otel-arrow/pull/1737/files#r2704902249 that we might need to be careful about how we invoke this concat_batches function due to the case where we're inserting a column that was not contained in the original batch.

There's another case we probably need to handle as well which is, if the column is dictionary encoded and inserting the new value would cause the dictionary to overflow, then we need to either expand the key type (e.g. convert from a Dict to a Dict) or convert from Dict to non dict encoded array.

albertlockett

@ThomsonTan thanks for taking this! I'm really excited to see this desperately needed feature getting implemented and this looks like a great start!

I left a few comments, mostly around where we could use different types to simplify the code, some edge cases in the OTAP protocol and suggestions for optimizations.

WRT to optimization, I have a few general comments:

We could probably do them in a followup PR after adding new benchmarks to the existing suite in benchmarks/benches/attribute_transform/main.rs. Having benchmarks will give us confidence that the optimizations we're adding are actually effective.
Another optimization we could consider in the future would be: currently we're materializing the RecordBatch for the rename & deletes, then materializing another for the inserts, and concatenating them together. This means that for each column, we create two Arc<dyn Array>, and after discard them while concatenating them into a new Arc<dyn Array> for the final result. We might be able to avoid this by:
a) inserting the new keys while we're doing the rename/delete
b) inserting the new values while taking the values/parent_id columns
I realize that this makes the implementation significantly more complex, so it's fine if we want to just document this as a future optimization. The only reason I'm calling it out ahead of time is that some of the code we write to handle OTAP edge cases (see https://github.com/open-telemetry/otel-arrow/pull/1737/files#r2704924006) would be different with this optimization in place.

albertlockett · 2026-01-20T18:06:41Z

rust/otap-dataflow/crates/pdata/src/otap/transform.rs

-            let should_materialize_parent_ids =
-                any_rows_deleted && schema.column_with_name(consts::PARENT_ID).is_some();
+            let should_materialize_parent_ids = (any_rows_deleted || insert_needed)
+                && schema.column_with_name(consts::PARENT_ID).is_some();


FYI in #1824 we made a change so we always materialize the parent_ids. This was the most expedient way to fix a subtle bug (#996)

Co-authored-by: albertlockett <[email protected]>

…d_batch

feat: support attribute insertion in OTAP transform

d898e57

github-project-automation bot added this to OTel-Arrow Jan 7, 2026

github-actions bot added the rust Pull requests that update Rust code label Jan 7, 2026

ThomsonTan added 21 commits January 8, 2026 00:07

Adding missing definition

cd1158c

Merge branch 'main' into add_insert_attributes

e4399b5

Format

fc259ac

Check empty slice

ad9e92e

Merge branch 'main' into add_insert_attributes

15b6820

Merge branch 'main' into add_insert_attributes

a01a34c

Merge branch 'main' into add_insert_attributes

c3c48ba

Add Insert action to AttributeProcessor

e79db3f

Format

2397800

Fix MetricsRegistryHandle

e8beb1c

Merge branch 'main' into add_insert_attributes

47e9f12

Make sure insert doesn't overwrite existing key

1d6bbd4

Merge branch 'main' into add_insert_attributes

39a9521

Add test

726dbf9

More format

41751db

Fix liternal constant

59f54bc

Rename literal

86f8eec

Fix literal constant

4dd108e

Merge branch 'main' into add_insert_attributes

e076c61

Merge branch 'main' into add_insert_attributes

97c7333

Merge branch 'main' into add_insert_attributes

2f7e191

ThomsonTan marked this pull request as ready for review January 16, 2026 19:08

ThomsonTan requested a review from a team as a code owner January 16, 2026 19:08